The dataset that is analyzed for this project is airline on-time performance data. The dataset used here is a subset of the full dataset which is available on this website:
http://stat-computing.org/dataexpo/2009/
This subset includes 123,496 flights from 1987 until 2008 from 29 different airlines and 317 airports
Furthermore, in order to visualize the flights, additional airport data is obtained from:
http://openflights.org/data.html
The longitude and latitude of departure and destination airport are added to the first dataset.
Before starting the EDA, it is necessary to perform some data wrangling with the dataset. The first step is to create a Date variable from the parameters Year, Month, and Day.
Secondly, the values of Departure and Arrival Delay need to be checked if they make sense or could contain incorrect values. Here is the summary of the two parameters.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1036.000 -2.000 0.000 8.184 6.000 1438.000 2349
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -64.000 -7.000 0.000 7.119 11.000 1014.000 2655
The maximum departure delay is 1438 min, i.e. 24h. The arrival delay for this flight, however, is -8 min:
## Dest Origin Year Month DayofMonth DayOfWeek DepTime CRSDepTime
## 89180 BWI MCO 1991 3 12 2 1900 1902
## ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
## 89180 2102 2110 DL 746 <NA> 122
## CRSElapsedTime AirTime ArrDelay DepDelay Distance TaxiIn TaxiOut
## 89180 128 NA -8 1438 787 NA NA
## Cancelled CancellationCode Diverted CarrierDelay WeatherDelay
## 89180 0 <NA> 0 NA NA
## NASDelay SecurityDelay LateAircraftDelay OriginLat OriginLon
## 89180 NA NA NA 28.42939 -81.30899
## DestLat DestLon Date
## 89180 39.17536 -76.66833 1991-03-12 12:00:00
Departing with an almost 24h delay and arriving 8 min early seems unrealistic. Looking at the scheduled and the actual depature time (19:02 and 19:00 respectively), it seems that the flight left in fact 2 minutes early and not delayed by 24h. The value is therefore corrected to the difference between scheduled and actual departure time.
The minimum departure delay is -1036 min (17h). Looking at the flight in question, it seems that the flight was scheduled for 8:40pm but left at 3:24 am (the following day):
## Dest Origin Year Month DayofMonth DayOfWeek DepTime CRSDepTime
## 94011 FLL BOS 2006 2 3 5 324 2040
## ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
## 94011 654 2359 B6 455 N523JB 1650
## CRSElapsedTime AirTime ArrDelay DepDelay Distance TaxiIn TaxiOut
## 94011 199 1634 415 -1036 1237 4 12
## Cancelled CancellationCode Diverted CarrierDelay WeatherDelay
## 94011 0 0 14 0
## NASDelay SecurityDelay LateAircraftDelay OriginLat OriginLon
## 94011 11 0 390 42.36435 -71.00518
## DestLat DestLon Date
## 94011 26.07258 -80.15275 2006-02-03 12:00:00
So, the departure delay should actually be 404 min. Also the parameter ActualElapsedTime (1650 min = 27.5h) is incorrect and should be 210 min.
Similary, the second highest departure delay of 1433 min is incorrect, as the scheduled departure time was 7:05pm and the actual departure time 6:58pm. So, the departure delay in the data set assumes a delay of almost 24h whereas the flight actually left 7 min early:
## Dest Origin Year Month DayofMonth DayOfWeek DepTime CRSDepTime
## 109568 ORF CVG 1998 12 7 1 1858 1905
## ArrTime CRSArrTime UniqueCarrier FlightNum TailNum
## 109568 2018 2030 DL 1040 N282WA
## ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Distance
## 109568 80 85 64 -12 1433 485
## TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay
## 109568 4 12 0 <NA> 0 NA
## WeatherDelay NASDelay SecurityDelay LateAircraftDelay OriginLat
## 109568 NA NA NA NA 39.04884
## OriginLon DestLat DestLon Date
## 109568 -84.66782 36.89461 -76.20122 1998-12-07 12:00:00
The remaining departure delays and all arrival delays seem to be correct.
Furthermore, the parameter DeltaTime is introduced as the difference between the actual elapsed time and the scheduled elapsed time.
Lastly, the parameter Route is created from the parameters Origin and Dest.
The first step is to look at departure and arrival delays, as this is what the dataset is about. Therefore, histograms of both departure and arrival delay are plotted. Also, the difference between scheduled and actual flight time is plotted as a histogram. Finally, a histogram of the newly created date variable is shown.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -60.000 -2.000 0.000 8.172 6.000 1014.000 2349
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -64.000 -7.000 0.000 7.119 11.000 1014.000 2655
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -216.0000 -8.0000 -2.0000 -0.9985 4.0000 266.0000 2655
The histograms of delays are negatively skewed; the departure delay more than the arrival delay. This makes sense as a negative departure delay indicates a departure time before the scheduled departure time, which does not happen very often.
Most flights depart with a delay between -10 and +25 min.
Also, the arrival delay is negatively skewed but not as much as the departure delay. This also makes sense, as an early arrival is more likely than an early departure. But due to airtraffic constraints an early arrival is not always possible, especially at bigger hubs, so it is not surprising to see more delays than early arrivals.
Most flights arrive with a delay between -20 and +40 min.
It can already be seen that more flights depart on time than arrive on time; the arrival delay distribution is more spread out than the departure delay distribution.
The delta flight time is more normally distributed, with a few outliers in the positive range.
The question therefore is whether the arrival delay can be directly attributed to the departure delay or if there are other parameters that impact arrival delay. This will be looked at in the bivariate section.
The date variable shows only few flights in 1987 (only October till December). Then from 1988 till 1998 the number of flights there is an approximately even number of flights per year. Following 1999 there is a strong increase (with a drop in 2002 and 2008).
Given the extreme outliers, a zoom is made for the departure and arrival delay and delta time histograms.
The next thing is to have a look at the number of flights per airline.
Delta Air Lines (DL) has the highest number of flights in the dataset, followed by Southwest (WN), American Airlines (AA), US Airways (US), and United Air Lines (UA).
Now the busiest departure and arrival airports are investigated by sorting the airports by origin and destination in descending order.
Due to the number of airports (317 different departure and destination airports), it is not possible to see which airports are the busiest. Therefore, subsets are created for airports with more than 2000 flights.
The three busiest airports in the dataset are:
The next step is to look at the most frequent routes.
Again, due to the high number of different routes (5686), it is impossible to see details, so a zoom on the routes which are frequented more than 250 times is made.
It can be seen that the busiest route is not out of the busiest airports but between San Francisco (SFO) and Los Angeles (LAX). LAX accounts for the 6 busiest routes (counting return flights as well) although it is only the 4th busiest airport. SFO is only number 10 in the airport list.
Finally, the number of flights per year, month, and day of the week are investigated.
In 1987, only flights between October 1 and December 31 are included in the dataset, whereas for the other years the entire year is available, hence the low number of flights in 1987.
It can be seen that the number of flights remained approximately constant until 1995. From then on, the number of flights continued to increase each year (with a drop in 2002).
Throughout the year, the number of flights varies little:
On weekends, there are less flights than during the working week. Especially on Saturdays, the number of flights are lower.
The dataset contains 123,496 flights from 29 different airlines. The flights took place between October 1, 1987 and December 31, 2008. Only domestic flights in the US are included; no international flights or international carriers are in this dataset.
The flights were performed between 317 different airports on a total of 5686 different routes.
The dataset includes 36 variables (29 in the original dataset and 7 were added during the EDA).
It has to be noted that this dataset is only a subset; the full dataset for this time frame contains 120 million flights.
The main features are departure and arrival delay, as they define the airline on-time performance. It is interesting to see what drives the delays.
Other interesting features are the airline, the origin and destination, and the time, as they all could have an impact on the on-time performance:
A date variable was created from the existing parameters Year, Month, and Day. Furthermore, origin and destination latitude and longitude were added from another dataset. Lastly, the parameter Route was created from the parameters Origin and Dest.
It was found that some delays had incorrect values and needed to be corrected. For details, please refer to the previous Section Data Wrangling.
The first step in the bivariate analysis is to create a scatter matrix of selected parameters. Please note that only the following parameters are kept:
Several interesting features need to be mentioned:
A second scatter matrix is produced with:
The aim is to see if there are regions with higher delays than others.
The second scatter matrix does not indicate any other correlation other than the one already described previously.
As seen in the first scatter matrix, it is interesting to see if and how departure and arrival delay are correlated. Therefore, scatterplots of arrival vs. departure delay are produced.
A zoom is done on smaller delays (please note that a log scale does not work here, as there are negative delays that could not be captured on a log scale).
These two plots show the linear trend between departure and arrival delay. They also show that there is a bigger scatter when the departure delay is small.
The next plot shows the actual flight time vs. the scheduled flight time.
It can be seen that the majority of flights has a scheduled time between 25 and 400 min. Fewer flights have scheduled times of up to 650 min (~11h).
The interesting part about this plot is that the deviation from the scheduled time tends to decrease for longer flights.
Now, the arrival delay is plotted versus the Delta Elapsed Time (this showed a moderate correlation in the first scatter matrix).
A zoom is also made to exclude outliers.
The linear trend that can be seen here are the on-time departures. The following plot only shows flights which leave between -2 and +2 minutes of their scheduled departure time (41,076 flights).
In the previous section, the busiest airports and airlines were determined. Now, it will be interesting to see if these airports also have the largest delays, i.e. if there is a correlation between number of flights in and out of an airport and the delay.
It can be seen that the airport with the largest departure and arrival delay is Newark Liberty International Airport (EWR), which is not the busiest airport (see the univariate section). For the departure delay the second and third highest delay is at Chicago O’Hare (ORD) and Hartfield-Jackson Atlanta (ATL) which are the two busiest airports. Dallas/Fort Worth (DFW), number 3 of busiest airports, is only on rank 18 and 9 for mean departure and arrival delays respectively. The bar plots alone do not allow seeing if there is a correlation.
Therefore, to check this in more detail, the average delay per airport is plotted versus the number of flights in and out of this airport.
This shows that there is no clear correlation between number of flights and average delay at a given airport; there must be other factors influencing the delay.
The next step is to see if the delay varies from airline to airline or if all airlines have similar delays.
The three airlines with the maximum departure delay are:
The maximum arrival delay is for the following carriers:
Interestingly, all airlines have positive delays except Hawaiian Airlines (HA).
Similarly to the airports, the busiest airlines are not the ones with the largest delays. This is also confirmed by the following plot (average delay vs. number of flights for each airline).
In order to go into more detail per airline, it is interesting to look at statistics other than the mean. Therefore, boxplots are created for each airline.
The plots show that a lot of points are outliers (outside 1.5 IQR). So, the same plots are created zoomed in. Here, the mean delay is also plotted.
This shows that the average delay per airline is ~8 min. The median delay however is very close to zero. Another interesting fact is that the lower and upper quartile (i.e. 50% of the flights) enclose departure delays of approximately -5 to +12 min and arrival delays of -12 to +18min. This fact is interesting as the connection time between two domestic flights is usually around 40 to 90 min. So, depending on the airport size (time between terminals/gates) the arrival delay could be acceptable in most cases.
The next step in this section is to see how the delays vary over time, i.e. delay per year, month, and day of the week.
The drop after 2001 suggests a link to the 9/11 attacks. Therefore, in the following plots, only flights from 2001 (the entire year) and September 2001 (month only) are shown.
Interestingly, on September 11, 2001 the mean arrival delay is negative. Following the 9/11 attacks, the FAA first stopped all civilian aircraft from taking off and then ordered all civilian aircraft to land.
The air space was then reopened on September 14. Both mean arrival and departure delay are very high that day, most probably due to the fact that aircraft needed to be repositioned after being diverted and/or grounded.
The strongest relationship that was found is between departure and arrival delay. It seems that a departure delay is almost always carried over and cannot be recovered. The analysis also showed that, especially for shorter flights, there must be other factors impacting the delays.
Longer flights tend to have less deviation from their scheduled flight time that short haul flights.
Some airlines seem to be more on-time than others. Most airlines have a similar medium and median departure and arrival delay, but some airlines (e.g. AQ = Aloha Airlines or HA = Hawaiian Airlines) have an average delay close to zero. The question is why this is the case: Are these airlines flying from and into less frequented airports? Are they operating in areas of the US that are not so much impacted by weather?
What is interesting is that the busiest airports/airlines do not have the largest delays. There is no correlation between number of flights and delays for neither airports nor airlines. Maybe it is a combination of both.
There seems to be a linear correlation between departure and arrival delay, especially at large delays. At shorter delays, there is more scatter and the relationship is not clearly linear anymore.
This could indicate that an arrival delay is likely to be caused by the late departure but not exclusively.
The previous section showed a strong linear correlation between arrival and departure delay. Now, the same plot is produced including coloring by airline.
Again the same plot but zoomed in on smaller delays.
There is no clear trend by airline; it seems that most airlines do suffer larger delays at some point.
The previous section showed a trend over time. Now the variation over the years and months are combined in a heat map, where year are along the x-axis and months are along the y-axis. The color indicates the average delay for that year/month.
To complement this illustration, the number of flights are also plotted in the same way.
The previous section showed that there is no correlation between number of flights and delays. Now, it is checked whether there is a correlation between:
To check the first point, the dataset is sorted by airline and airport and two heatmaps (departure and arrival delay) are created with the mean delay for each airport/airline pair.
Unfortunately, this does not allow a clear view of whether the airport/airline combination has an effect on delay. It can be seen that some airlines have larger delays at certain airports but there is no systematic view. The high number of blank fields is due to the fact that an airline does not serve all airports in the dataset.
Therefore, a subset is created for airport/airline pairs with more than 250 flights.
A similar plot is produced for the routes, i.e. a heatmap between origin and destination.
Unfortunately, due to the high number of airports (317), this plot does not reveal much. It cannot be concluded whether a certain route is subject to higher delays than others.
So again, a subset is created with pairs that have more than 125 flights.
Previously it was shown that flights out of Newark (EWR) have the highest mean delay. Here in the heat map, EWR pops up again.
So, a map is plotted with all destinations served out of Newark colored by the average departure delay.
Looking at the Top5 airports (sorted by mean departure delay in descending order) shows that the maximum delay is for flights to San Antonio International (SAT) followed by William P Hobby Airport, Houston (HOU) and Metropolitan Oakland International Airport (OAK):
## # A tibble: 5 × 10
## Dest mean_dep_delay mean_arr_delay median_dep_delay median_arr_delay
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 SAT 69.50000 146.00000 69.5 146.0
## 2 HOU 55.50000 59.00000 55.5 59.0
## 3 OAK 40.50000 95.00000 40.5 95.0
## 4 PHL 38.00000 42.00000 0.0 23.0
## 5 MCI 36.86957 17.04545 0.0 -11.5
## # ... with 5 more variables: OriginLon <dbl>, OriginLat <dbl>,
## # DestLon <dbl>, DestLat <dbl>, n <int>
All three airports are only served twice from EWR and in each case only one flight was heavily delayed.
## Origin Dest DepDelay UniqueCarrier
## 9635 EWR OAK 81 CO
## 9667 EWR OAK 0 CO
## 30734 EWR SAT -2 CO
## 30938 EWR SAT 141 CO
## 81630 EWR HOU 105 CO
## 81858 EWR HOU 6 CO
The question is therefore whether this sample is really representative of all flights from Newark to these destinations (keep in mind that this is a reduced dataset). Incidentally, all flights in question were performed by Continental Air Lines (CO).
Looking at destinations out of Newark with more than 20 flights shows that Kansas City (MCI) and Raleigh-Durham International (RDU) have average delays of 36.9 min and 36.4 min respectively. The plot of delay vs. number of flights is also created.
## # A tibble: 5 × 10
## Dest mean_dep_delay mean_arr_delay median_dep_delay median_arr_delay
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 MCI 36.86957 17.04545 0.0 -11.5
## 2 RDU 36.36957 32.52174 0.5 1.5
## 3 SEA 22.50000 17.40909 1.0 -3.0
## 4 PIT 20.32759 21.32759 0.0 5.0
## 5 JAX 19.65517 16.27586 0.0 4.0
## # ... with 5 more variables: OriginLon <dbl>, OriginLat <dbl>,
## # DestLon <dbl>, DestLat <dbl>, n <int>
Similarly to before, no correlation between number of flights and delay is to be noted. The largest delays occur at low number of flights but this may be due to the fact that the low sample number is not representative of the entire population for that route.
The next two heat maps show the average delay and number of flights out of EWR for each airline/destination pair.
It can be seen that Continental (CO) serves most destinations out of Newark along with Expressjet Airlines (XE). The number of Continental flights are higher than the Expressjet flights. In terms of delay, Continental has a better performance out of Newark than Expressjet. This could be linked once again to the low number of flights and that this may not be representative.
With the available data it is difficult to pinpoint the exact reason for a delay. It can be shown that the delay is independent of number of flights at a given location or a given airline. It can be shown that if there is knowledge about external data (here the 9/11 attacks) delays can be explained.
The heat map of departure and arrival delay by month and year showed two interesting things. Firstly, it confirms that there is a seasonal variation of delay: higher delays around Christmas (December, January) and during summer holidays (June till August). Secondly, the heat map shows a drop after 1991 (unknown reason) and after 2001 (due to the 9/11 attacks).
Interestingly, the heat map shows that the minimum delay (both departure and arrival) occured in September 2002. The plot with the number of flights indicate that one year after 9/11 the number of flights dropped; there is a gap between August and October 2002.
Lastly, an interesting feature revealed by the heat map is that the number of flights after 2002 continue to increase as they did before. The 9/11 attacks had an impact on 2002, but the following years follow the pre-2001 trend.
The interesting thing about this plot is that departure and arrival delay show a strong linear correlation (correlation coefficient is 0.905). This means that once an aircraft takes off with a delay, the delay cannot be recovered in most cases, especially for larger departure delays. For shorter departure delays, the scatter increases, which means that other factors contribute to the arrival delay.
This plot is interesting as it shows the delay evolution over the years and throughout each year. It is possible to identify peak months such as December/January or summer holiday periods as well as the trend of increasing delays over the year. Lastly, drops like the one following the 9/11 attacks which do not follow the general trend over the years can be identified.
This plot is interesting as it shows the mean departure delay out of Chicago O’Hare International Airport for each destination in the available dataset. Despite being the busiest airport (i.e. highest number of flights) most flights have reasonable delays (< 20 min). Some destinations have high delays of 40-60 min, but the number of flights for these airports are low and hence it is questionable whether this sample is representative.
Another interesting thing to be mentioned here is that there is a gap from Montana/North Dakota down to the half of Texas where no airports are served from Chicago O’Hare.
Please note that there are some destinations outside of the contiguous US map. These destinations are Honolulu (HNL), Kahului (OGG), Anchorage (ANC), Cyril E. King Airport on Saint Thomas (STT), and San Juan in Puerto Rico (SJU).
The dataset was supposed to be a cleaned dataset. However, when looking at the data it became obvious that some data wrangling was necessary (see the above section). The cleaning was done for departure delay, arrival delay and elapsed time for flights that stood out. A more thorough data wrangling of all parameters was not possible due to limited time. Also, an automized cleaning of the parameters based on scheduled departure and arrival times would be better instead of the manual cleaning that was done here.
The dataset contains the variables CarrierDelay, WeatherDelay, NASDelay (NAS=National Air Space), SecurityDelay, and LateAircraftDelay. Only 34,244 flights out of 123,496 flights in the dataset contain values for these parameters. It would nonetheless be interesting to detail the delays for these flights to analyze the cause in more detail.
Also, it could be interesting to pick a few airports/airlines/routes with large/small delays and to see if there are factors that can be identified to be responsible for the delays. Especially including weather data would interesting; what are prevailing wind/precipitation etc.? In the analysis it was shown how the 9/11 attacks impacted delays in the US. It could also be interesting to add information about catastrophes (e.g. hurricane Katrina) or also strikes if relevant to air traffic (pilots, ATC, ground workers etc.).
The data used for this project is only a subset of the full dataset that is available (see link above). The full dataset contains almost 120 million flights for the same time frame from 1987 till 2008. It would therefore be interesting to see how representative this subset was or if the results are skewed due to the selection of the flights.